Introduction to Statistics B

Maria Anastasiadi

2025-04-25

Statistical Foundations Part2

4. Analysis of Variance (ANOVA)

Introduction


▪️ A limitation with the t-test is that only two means can be compared at one time.

▪️ However, in many experimental set-ups we want to compare more than two means simultaneously.

▪️ Testing each pair of means with a t-test is not recommended as the probability of false positives increases with each test run.

ANOVA Definiftion

Analysis of Variance (ANOVA) is the recommended method for determining whether or not there is a statistically significant difference between the means of three or more independent groups.


📝 ANOVA answers the question as to whether there is greater variability between groups than within groups.

ANOVA Hypothesis


  • \[{H_{0}: µ_{1} = µ_{2} = µ_{3} = … = µ_{k}}\] The means are equal for each group.

  • \(H_{1}\): at least one of the means is different from the others

📝 ANOVA models facilitate the analysis of many different kinds of experimental data and they are the workhorse of basic statistical analysis.

4.1 One-Way ANOVA


One-way ANOVA predicts how the mean value of a numeric variable (the response variable) is affected by the levels of a categorical variable (the predictor variable).

These levels may represent:

a) quantitative variations (e.g. the effect of different concentrations of an antibiotic on bacterial growth).

b) qualitative variations (e.g. the effect of apple cultivar on sugar/acid ratio).

ANOVA and Linear Regression


  • The definition of ANOVA is similar to the definition of simple linear regression you have already encountered earlier.

  • In fact, ANOVA and regression are both special cases of the general linear model.

4.1.1 ANOVA Principles


  • ANOVA examines the magnitudes of three different sources of variation in the data:

A) The Total Variation: the variation among all the units in the study.

B) Between-Group Variation: the variation due to the effect of experimental treatments or control groups (explained variation).

C) Within-Group Variation: the variation due to other sources (error variation).

ANOVA in a Nutshell


ANOVA is looking at changes in variation. If the amount of variation between treatments is sufficiently large compared to the within-group variation, this suggests that the treatments are probably having an effect.


  • But how is each type of variation calculated?

Example

Consider an experiment where we compare the bioaccesibility of Vitamin D depending on the type of flour used in a baked product.
▪️ The fibres used are :wheat pea, apple,

Import csv file Fibre into R

First lines of Fibre dataset
Fibre Replicate Bioaccessibility
Wheat 1 80
Wheat 2 83
Wheat 3 74
Wheat 4 86
Wheat 5 82
Wheat 6 88
Pea 1 77
Pea 2 71
Pea 3 81
Pea 4 78

Plot raw data and total mean

- The distance of each sample from the blue line represents the deviation of each measurement from the total mean. The sum of all the deviations is zero.

1. Calculate the Total Variation (SST)


  • To find the total variation we need to find the sum of squares of the deviations (SST).
## [1] 624.5

2. Calculate the Within-Groups Variation (SSE)


  • The next step is to calculate the error or residual variation.

  • First we need to calculate the means per group.

‘group.mean’ table
Fibre GM
Apple 85.00000
Pea 75.33333
Wheat 82.16667

Plot raw data, total mean & group means

- The distance of each sample from the red line is the difference of each measurement from the group mean. This is the ‘left over’ variation attributed to differences among individuals.

Within-group variability


  • To calculate the within-group variability we need to take the sum of squares for the deviations from each group mean. This is called the residual sum of squares.

Calculate sum of squares (SSE)

## [1] 328.1667

3. Between-Group Variation (SSG)


Plot total mean and group means

- The distance of each dot from the blue line is the difference between each group mean and the Total Mean. This is variation due to differences among treatment groups.

Between Treatments Variability


  • As previously we need to calculate the sum of squares for these differences. This is a measure of the variability attributed to differences among treatments.

Calculate sum of squares (SSG)

## [1] 296.3333

Total Sum of Squares


📝 The SUM of SQUARES we calculated earlier are related by the formula:
\({SST = SSG + SSE}\)

Thus the total variation is composed of two parts, one due to groups and one due to error.

4.1.2 Degrees of Freedom in ANOVA


  • An issue with using the sum of squares we calculated previously, is that they are dependent on the sample size and the number of groups.

  • To standardise the sum of squares we divide by the degrees of freedom for each type of variation.

Degrees of Freedom Formula


📝 The Degrees of Freedom in ANOVA are related by the formula:
\({DFT = DFG + DFE}\)


DFT = (Number of observations - 1)

DFG = (Number of treatment groups - 1)

DFE = (Number of observations - Number of groups)

4.1.3 Mean Squares


The Sum of Squares for each type of variation in ANOVA is calculated as:

\({MST = \frac{SST}{DFT}}\)

\({MSG = \frac{SSG}{DFG}}\)

\({MSE = \frac{SSE}{DFE}}\)

The Mean Sum of Squares is the standardised form of the Sum of Squares.

4.1.4 The F test

  • The final question is whether we can reject the Null Hypothesis or not.

  • To decide this we use the ANOVA F statistic.

F Statistic Definition


  • The F statistic is the ratio of MSG/MSE (the variation due to treatment over the variation due to error).

  • If \(H_{0} =\) TRUE the F-statistic is ~ 1 and if the Alternative hypothesis is true, it tends to be large.

The ANOVA F test


To test the Null Hypothesis in a One-Way ANOVA we calculate the F statistic:

\[{F = \frac{MSG}{MSE}}\]

F-test P-Value


The P-value of the F test is the probability that a random variable having the F(I-1, N-1) distribution is ≥ F, the calculated value of the F Statistic.

Example


  • Find the F statistic for the bioaccessibility problem and decide whether we can reject the Null Hypothesis.

1. Calculate Degrees of Freedom

## [1] 17
## [1] 2
## [1] 15

2. Find MSS

## [1] "mst= 36.74"
## [1] "msg= 148.17"
## [1] "mse= 21.88"

3. Find the F statistic

## [1] "f=msg/mse= 6.77"

The F statistic is 6.77.

F Table


- If we look at an F table for critical values for α=0.05, the critical value for F(2,15) is 3.68 which is smaller than our F value. So we can reject the Null hypothesis.

4.1.5 Coefficient of Determination \(R^2\)


▪️ Another statistic we can calculate from an ANOVA table is the coefficient of determination
\({R^2 = \frac{SSG}{SST}}\).

rsq <- explained.variation/total.variation
rsq
## [1] 0.4745129

▪️ This coefficient tells us that 47.5% of the total variation in Bioaccessibility is explained by the different type of fibre and the other 52.5% is explained by sample-to-sample variation within each group.

4.1.6 One Way Anova using R


  • Base R can carry out a One-Way ANOVA using simple functions such as the lm() and aov() functions.

  • There are also dedicated statistical libraries such as afex which can carry out ANOVA.

How to do a one-way ANOVA in one step:

anova1 <- aov(Bioaccessibility~as.factor(fibre$Fibre), data=fibre) 

4.1.7 Assumptions for One-Way ANOVA


▪️ The continuous variable has a NORMAL distribution in ALL relevant populations (groups) or at least it doesn’t have any gross outliers.

▪️ Not as important if the sample is large (Central Limit Theorem).

▪️ If the sample is far from normal &/or small, we may need to consider alternative methods (non-parametric).

Create QQ plot for each group

QQ plots sort data in ascending order, and plot them against quantiles from a theoretical normal distribution.

Create a box-plot

4.1.8 Assumptions for the Residuals


The main assumption the residuals need to meet are the following:

Check Residual Assumptions


Let’s check the assumptions for the fibre dataset.

1) Residuals QQ plot

## Draw QQ plot
plot(anova1, 2)

Visualise residuals QQ plot

No severe deviations from normality

2) Equality of Variation:

## Draw Fitted values vs Residuals plot
plot(anova1, 1)

Visualise residual variance plot

Again, we don’t see any deviations from this assumption. To make sure we can apply Levene test.

4.1.9 Multiple Comparisons


In the Bioaccessibility example we rejected the null hypothesis (All Means are Equal) in favour of the Alternative hypothesis (Not All Means are Equal).

▪️ This however is not very informative. We want to know which group means are statistically significantly different.

▪️ To do this we need to make multiple pairwise comparisons using t-tests.

▪️ However, when many t-tests are applied simultaneously we run the risk of false positives.

4. 1.10 Post-Hoc Tests


  • To address the risk of false positives we apply Post-Hoc tests (because they can only be applied after we reject \(H_{0}\))

Most Popular Post-Hoc Tests

  • Bonferroni test
  • Benjamini-Hochberg test
  • Scheffé’s test
  • Duncan’s new multiple range test
  • Tukey Honest Significant Differences test (Tukey HSD)

Tukey Test in R


TukeyHSD(anova1) 
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Bioaccessibility ~ as.factor(fibre$Fibre), data = fibre)
## 
## $`as.factor(fibre$Fibre)`
##                  diff         lwr       upr     p adj
## Pea-Apple   -9.666667 -16.6810832 -2.652250 0.0072578
## Wheat-Apple -2.833333  -9.8477499  4.181083 0.5586372
## Wheat-Pea    6.833333  -0.1810832 13.847750 0.0567231

Which pairs of techniques vary significantly?

4.1.10 Plot the results


  • Finally we plot the means for each group in a barplot.
  • Before doing that we need to calculate the standard error (se) for each group so we can add it to each bar. Remember we can find the se from the formula: \(se = s/\sqrt{n}\)

Visualise bar plot for the data